We intend to find who is the most valuable director and who is the most valuable actor. To measure the word, valuable, in quantitative scale, we use to variables-IMDB score and total profit. We selected directors and actors with IMDB score higher than 6.0 and put the score in x axis and total profit in y axis. The two plots showed us very intriguing results. We observed that there are a few directors and actors who have every high IMDB scores, but gained low profit. They are indicated by blue dots. Nevertheless, there are a considerable number of directors and actors who have both high score and high profit. We labeled them in red dots. We could spot some famous names among directors such as James Cameron, Christopher Nolan and Steven Spielberg.
We could also observe popular actors and actresses in the plot such as Scarlett Johansson, Leonardo DiCaprio and Brad Pitt. Thus, using IMDB score and total profit could give us an intuitive result about the profitability of directors and actors. Although further analysis need to be conducted to find more detailed and reliable results, these two plots could give us a preliminary result about the most valuable director and actor.
| color | director_name | num_critic_for_reviews | duration | director_facebook_likes | actor_3_facebook_likes | actor_2_name | actor_1_facebook_likes | gross | genres | actor_1_name | movie_title | num_voted_users | cast_total_facebook_likes | actor_3_name | facenumber_in_poster | plot_keywords | movie_imdb_link | num_user_for_reviews | language | country | content_rating | budget | title_year | actor_2_facebook_likes | imdb_score | aspect_ratio | movie_facebook_likes | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Length:4846 | Length:4846 | Min. : 1.0 | Min. : 7.0 | Min. : 0 | Min. : 0.0 | Length:4846 | Min. : 0.0 | Min. :0.000e+00 | Length:4846 | Length:4846 | Length:4846 | Min. : 5 | Min. : 0 | Length:4846 | Min. : 0.000 | Length:4846 | Length:4846 | Min. : 1.0 | Length:4846 | Length:4846 | Length:4846 | Min. :1.100e+03 | Min. :1916 | Min. : 0 | Min. :1.600 | Min. : 1.180 | Min. : 0 | |
| Class :character | Class :character | 1st Qu.: 50.0 | 1st Qu.: 94.0 | 1st Qu.: 7 | 1st Qu.: 130.5 | Class :character | 1st Qu.: 607.5 | 1st Qu.:4.197e+06 | Class :character | Class :character | Class :character | 1st Qu.: 8351 | 1st Qu.: 1394 | Class :character | 1st Qu.: 0.000 | Class :character | Class :character | 1st Qu.: 64.0 | Class :character | Class :character | Class :character | 1st Qu.:5.000e+06 | 1st Qu.:1999 | 1st Qu.: 277 | 1st Qu.:5.800 | 1st Qu.: 1.850 | 1st Qu.: 0 | |
| Mode :character | Mode :character | Median :110.0 | Median :103.0 | Median : 48 | Median : 365.0 | Mode :character | Median : 984.0 | Median :2.769e+07 | Mode :character | Mode :character | Mode :character | Median : 33520 | Median : 3075 | Mode :character | Median : 1.000 | Mode :character | Mode :character | Median : 155.0 | Mode :character | Mode :character | Mode :character | Median :1.800e+07 | Median :2005 | Median : 593 | Median :6.600 | Median : 2.350 | Median : 158 | |
| NA | NA | Mean :139.6 | Mean :107.9 | Mean : 692 | Mean : 632.0 | NA | Mean : 6566.9 | Mean :8.623e+07 | NA | NA | NA | Mean : 83232 | Mean : 9665 | NA | Mean : 1.371 | NA | NA | Mean : 269.6 | NA | NA | NA | Mean :3.299e+07 | Mean :2002 | Mean : 1635 | Mean :6.422 | Mean : 2.152 | Mean : 7342 | |
| NA | NA | 3rd Qu.:193.0 | 3rd Qu.:118.0 | 3rd Qu.: 190 | 3rd Qu.: 635.0 | NA | 3rd Qu.: 11000.0 | 3rd Qu.:9.396e+07 | NA | NA | NA | 3rd Qu.: 94279 | 3rd Qu.: 13740 | NA | 3rd Qu.: 2.000 | NA | NA | 3rd Qu.: 322.8 | NA | NA | NA | 3rd Qu.:4.000e+07 | 3rd Qu.:2011 | 3rd Qu.: 912 | 3rd Qu.:7.200 | 3rd Qu.: 2.350 | 3rd Qu.: 2000 | |
| NA | NA | Max. :813.0 | Max. :511.0 | Max. :23000 | Max. :23000.0 | NA | Max. :640000.0 | Max. :2.784e+09 | NA | NA | NA | Max. :1689764 | Max. :656730 | NA | Max. :43.000 | NA | NA | Max. :5060.0 | NA | NA | NA | Max. :4.200e+09 | Max. :2016 | Max. :137000 | Max. :9.500 | Max. :16.000 | Max. :349000 | |
| NA | NA | NA’s :46 | NA’s :14 | NA’s :39 | NA’s :23 | NA | NA’s :7 | NA’s :158 | NA | NA | NA | NA | NA | NA | NA’s :13 | NA | NA | NA’s :20 | NA | NA | NA | NA’s :96 | NA’s :43 | NA’s :13 | NA | NA’s :322 | NA |
| Genres | Number of Movies | Weights in All Genres | Mean Gross in 2001 | Mean Gross in 2002 | Mean Gross in 2003 | Mean Gross in 2004 | Mean Gross in 2005 | Mean Gross in 2006 | Mean Gross in 2007 | Mean Gross in 2008 | Mean Gross in 2009 | Mean Gross in 2010 | Mean Gross in 2011 | Mean Gross in 2012 | Mean Gross in 2013 | Mean Gross in 2014 | Mean Gross in 2015 | Mean Gross in 2016 | Mean Gross | Mean Profit |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Action | 1099 | 7.91 | 7.5113359 | 5.0348338 | 3.6659315 | 6.1610049 | 5.3109100 | 9.2933920 | 9.3247616 | 6.9523944 | 11.1307750 | 9.9881282 | 8.6021768 | 13.8992884 | 10.7120841 | 9.0187928 | 15.4251190 | 9.5714290 | 5.0801528 | 5.7153813 |
| Adventure | 878 | 6.32 | 5.5362595 | 6.6593737 | 7.0536493 | 6.9766441 | 7.3259330 | 7.3137728 | 11.9040353 | 10.1560429 | 9.7798759 | 13.0022783 | 13.1423075 | 14.5044585 | 12.0745704 | 8.7810398 | 21.3889007 | 14.1425232 | 4.8394623 | 6.3247249 |
| Animation | 233 | 1.68 | 1.2743054 | 2.6344065 | 1.4076387 | 2.6671145 | 3.4189402 | 1.9859662 | 3.7050303 | 4.1424675 | 3.9280398 | 3.7226985 | 4.9816557 | 3.4491392 | 4.3410987 | 1.8819536 | 24.6721281 | 16.3805887 | 1.4992517 | 1.1446435 |
| Biography | 290 | 2.09 | 0.2482900 | 0.8268890 | 0.5006664 | 0.6655225 | 0.7232949 | 0.2832208 | 0.9184040 | 0.6639806 | 0.4592782 | 0.5509327 | 1.6751161 | 1.5527428 | 1.2447248 | 0.1715944 | 5.6671196 | 3.2204343 | 0.5458465 | 0.4278311 |
| Comedy | 1817 | 13.08 | 6.3135925 | 6.2782505 | 5.5654253 | 7.0248587 | 6.6313673 | 6.5561736 | 7.4834405 | 8.0553552 | 8.8167848 | 8.5377880 | 8.3677087 | 7.2378317 | 7.2213983 | 5.4196184 | 8.1840969 | 5.2999010 | 5.2076130 | 5.2855890 |
| Crime | 850 | 6.12 | 2.2275215 | 1.7363677 | 1.8420637 | 2.0705889 | 1.9190191 | 2.5618709 | 1.8452078 | 1.3744541 | 2.8351667 | 1.9909499 | 2.9746719 | 2.3618794 | 3.0741678 | 0.9705592 | 6.1597263 | 2.7949582 | 2.7619023 | 2.3630079 |
| Documentary | 121 | 0.87 | 0.0303067 | 0.2581488 | 0.1063518 | 0.1308012 | 0.0655831 | 0.0624536 | 0.1779249 | 0.1967070 | 0.1319757 | 0.0707202 | 0.0671857 | 0.0001423 | 0.0000855 | 0.0000000 | 1.3887833 | 0.9644173 | 0.0428024 | 0.1569546 |
| Drama | 2484 | 17.88 | 5.1757259 | 4.9022803 | 4.8857020 | 6.1129869 | 4.1574639 | 8.0350153 | 5.8787095 | 5.4907725 | 4.9054520 | 6.9132858 | 6.0174162 | 8.2031323 | 7.0748358 | 2.7122157 | 5.7746380 | 3.2301933 | 4.6359725 | 6.1435429 |
| Family | 522 | 3.76 | 2.7756047 | 4.6443551 | 4.6498030 | 4.3013900 | 5.0755897 | 4.7462867 | 5.7610736 | 7.2129003 | 5.3168796 | 5.1230471 | 5.6830006 | 5.0151444 | 5.3073983 | 3.2243129 | 18.1183430 | 12.3750335 | 2.4679865 | 3.3040821 |
| Fantasy | 571 | 4.11 | 3.4203568 | 3.9765935 | 3.1728971 | 4.9017013 | 5.6133112 | 3.9915152 | 7.4471926 | 8.5155584 | 5.7441466 | 7.2807121 | 7.9931934 | 6.4181920 | 3.8406592 | 2.3744208 | 17.9812686 | 11.5767160 | 4.2649923 | 4.2234106 |
| Film-Noir | 6 | 0.04 | 0.0000000 | 0.0000000 | 0.0000000 | 0.0000000 | 0.0000000 | 0.0000000 | 0.0000000 | 0.0000000 | 0.0000000 | 0.0000000 | 0.0000000 | 0.0000000 | 0.0000000 | 0.0000000 | 0.5300000 | 0.3842400 | 0.0000000 | 0.0000000 |
| Game-Show | 1 | 0.01 | 0.0000000 | 0.0000000 | 0.0000000 | 0.0000000 | 0.0000000 | 0.0000000 | 0.0000000 | 0.0000000 | 0.0000000 | 0.0000000 | 0.0000000 | 0.0000000 | 0.0000000 | 0.0000000 | 3.6882378 | 1.5882378 | 0.0000000 | 0.0000000 |
| History | 200 | 1.44 | 1.0040606 | 0.3891286 | 0.4285096 | 0.5665936 | 0.3663727 | 0.6388674 | 0.4077878 | 0.5642908 | 0.1483307 | 0.6916353 | 0.3267927 | 0.8188780 | 1.0108431 | 0.2102009 | 6.5065776 | 2.9859592 | 0.7063565 | 0.2647965 |
| Horror | 535 | 3.85 | 0.8555357 | 1.2240064 | 1.4101534 | 0.9690721 | 1.3558102 | 1.2304313 | 0.9386224 | 1.2128496 | 0.9710811 | 1.6731958 | 1.8966377 | 1.0272911 | 0.6957171 | 0.7508504 | 4.9354089 | 3.1503099 | 0.4358286 | 0.5607328 |
| Music | 209 | 1.50 | 0.4393631 | 0.3409285 | 0.4099673 | 0.7991010 | 0.9363650 | 0.4176639 | 0.7821179 | 0.4671639 | 0.4715517 | 0.3729657 | 0.2861659 | 0.2257222 | 0.8693450 | 0.0443235 | 5.2228877 | 3.4092869 | 0.2268857 | 0.3840162 |
| Musical | 129 | 0.93 | 0.2010248 | 0.2202015 | 0.1050264 | 0.1684012 | 0.5726231 | 0.9350985 | 0.4115464 | 0.6981324 | 0.7628084 | 0.2108741 | 1.3509458 | 0.9818237 | 0.0000000 | 0.0000000 | 9.7048240 | 6.8209531 | 0.0644281 | 0.3525323 |
| Mystery | 469 | 3.38 | 0.8753704 | 2.9847455 | 2.6060465 | 2.0650152 | 2.4663044 | 1.1278823 | 3.0337066 | 1.0664908 | 2.4473679 | 1.1820572 | 1.3231570 | 1.9764144 | 0.9639744 | 0.9221128 | 8.2563691 | 5.0673599 | 0.7971685 | 2.3881900 |
| News | 3 | 0.02 | 0.0000000 | 0.0000000 | 0.0000000 | 0.0000000 | 0.0000000 | 0.0000000 | 0.0191215 | 0.0000000 | 0.0000000 | 0.0000000 | 0.0007409 | 0.0000000 | 0.0000000 | 0.0000000 | 0.6620821 | -0.0983512 | 0.0000000 | 0.0000000 |
| Reality-TV | 1 | 0.01 | 0.0000000 | 0.0000000 | 0.0000000 | 0.0000000 | 0.0000000 | 0.0000000 | 0.0000000 | 0.0000000 | 0.0000000 | 0.0000000 | 0.0000000 | 0.0000000 | 0.0000000 | 0.0000000 | 3.6882378 | 1.5882378 | 0.0000000 | 0.0000000 |
| Romance | 1069 | 7.69 | 2.2566642 | 5.0416630 | 3.4418709 | 2.9037801 | 3.2572954 | 4.9505569 | 3.7935779 | 4.9905684 | 2.3822034 | 2.5683145 | 1.5257596 | 2.5658222 | 1.7679143 | 1.5471284 | 7.0302851 | 4.4717040 | 2.9332282 | 3.0518185 |
| Sci-Fi | 580 | 4.17 | 2.7900955 | 1.9852999 | 2.2179700 | 1.4032017 | 1.5226040 | 2.7674792 | 8.3710624 | 2.5597188 | 4.2696645 | 5.6192155 | 8.4344678 | 9.6668194 | 8.2730364 | 5.7962020 | 15.8368669 | 10.1614026 | 1.8724605 | 2.4054973 |
| Short | 5 | 0.04 | 0.0000000 | 0.0000000 | 0.0000000 | 0.0000000 | 0.0059187 | 0.0075189 | 0.0000000 | 0.0000000 | 0.0000000 | 0.0000000 | 0.0000000 | 0.0000000 | 0.0000000 | 0.0000000 | 0.2786963 | -0.3123037 | 0.0000000 | 0.0000000 |
| Sport | 176 | 1.27 | 0.2174055 | 0.5419665 | 0.7743276 | 1.1244454 | 0.4909349 | 0.3290597 | 0.4590991 | 0.2120606 | 1.0067106 | 0.1316734 | 0.5553004 | 0.2622473 | 0.7115048 | 0.0700666 | 6.0399776 | 3.1415766 | 0.2654963 | 0.4002580 |
| Thriller | 1348 | 9.70 | 2.9990741 | 4.3714872 | 4.1243118 | 5.1675688 | 3.5709262 | 4.7007420 | 3.9046643 | 4.6536180 | 4.9463396 | 7.0516381 | 6.7165368 | 6.2413112 | 8.1795891 | 2.3495328 | 8.4273057 | 4.9689390 | 4.3291819 | 4.5934421 |
| War | 207 | 1.49 | 0.9485443 | 0.6133800 | 0.4057599 | 0.6758511 | 0.3926974 | 0.7265270 | 0.3861284 | 0.2747973 | 0.1624909 | 0.4144702 | 0.2499750 | 1.8874680 | 0.0444224 | 0.1259870 | 6.9520090 | 3.5084852 | 0.8276862 | 0.4618998 |
| Western | 93 | 0.67 | 0.0686140 | 0.2084806 | 0.1538371 | 0.0195353 | 0.0878110 | 0.0280399 | 0.0000000 | 0.2741451 | 0.2481137 | 0.4583953 | 0.2600021 | 0.0460585 | 0.6786576 | 0.0029766 | 5.8072120 | 2.8088210 | 0.0142245 | 0.1065153 |
We first draw a plot to explore the relationship.
We find that there may be linear relationship between log(year) and number of movies, Thus we run regression analysis
| Estimate | Std. Error | t value | Pr(>|t|) | |
|---|---|---|---|---|
| (Intercept) | -0.6015224 | 0.1066876 | -5.638164 | 2e-07 |
| x | 0.0676408 | 0.0020141 | 33.584483 | 0e+00 |
First of all, pick all the numeric variables and remove all the rows containing missing values.
We would like to know the relation between year and number of critic for reviews. From the data, we know that the sample sizes for the period before 1980 are to small. There are too much noice in the data for the period before 1981. We only consider the data for the period after 1981. Besides, we delete some of sample points which are too extreme.
We calculate the evolution of the mean of num_critic_for_reviews with time.
From 2014, the decline of of film industry make num_critic_for_reviews decreases. This phenomenon is not natural. We only consider the evolution between 1990 and 2013. This evolution could be fitted by a quadratic function:
\[f(x)=0.85(x-1995)^2+52\]
From the above graph, we could see that there is a decresing trend before 1994 and an increasing trend after 1994. We use data after 1994 to explore the line relation between critics of reviews and title year to see whether our quadratic function reflect the true relation.
| Estimate | Std. Error | t value | Pr(>|t|) | |
|---|---|---|---|---|
| (Intercept) | -187.3435185 | 11.2706393 | -16.62226 | 0 |
| year_critic | 0.0958022 | 0.0056241 | 17.03437 | 0 |
During the period 1995-2013, log of num_critic_for_reviews increased at a linear speed at the time (year) evolved. The relationship could be expressed as:
\[y=-194.71693+0.09958x\]
Draw the graph showing the evolution of movie_facebook_likes during the period 1986-2016.
From 2014, the decline of of film industry make movie_facebook_likes decline. This phenomenon is not natural. We only consider the evolution between 1993 and 2013. This evolution could be fitted by a quadratic function:
\[g(x)=200(x-2000)^2+1600\]
From the above graph, we could see that there is a decresing trend before 2000 and an increasing trend after 2000. We use data after 2000 to explore the line relation between critics of reviews and title year to see whether our quadratic function reflect the true relation.
| Estimate | Std. Error | t value | Pr(>|t|) | |
|---|---|---|---|---|
| (Intercept) | -409.1178990 | 37.1779187 | -11.00432 | 0 |
| year_fblike | 0.2081514 | 0.0185148 | 11.24241 | 0 |
During the period 1995-2013, log of movie_facebook_likes increased at a lnear speed at the time (year) evolved. The relationship could be expressed as:
\[y=-428.7474+0.218x\]
Finally, we tried to do classification based on the data using distance analysis. We want to classify the movie into groups with imdb score higher than 8.0 and less than 8.0.
The first step is to find out significant variable related to imdb score used for the analysis. Thus, we run regression analysis first to find out the significant variables.
| Estimate | Std. Error | t value | Pr(>|t|) | |
|---|---|---|---|---|
| (Intercept) | 6.5596374 | 0.0550183 | 119.226496 | 0.0000000 |
| aspect_ratio | -0.0511509 | 0.0248119 | -2.061546 | 0.0393106 |
| Estimate | Std. Error | t value | Pr(>|t|) | |
|---|---|---|---|---|
| (Intercept) | 6.3908696 | 0.0187618 | 340.631858 | 0 |
| cast_total_facebook_likes | 0.0000059 | 0.0000009 | 6.532097 | 0 |
| Estimate | Std. Error | t value | Pr(>|t|) | |
|---|---|---|---|---|
| (Intercept) | 6.5021751 | 0.0197386 | 329.413639 | 0.0e+00 |
| facenumber_in_poster | -0.0372243 | 0.0080937 | -4.599151 | 4.4e-06 |
| Estimate | Std. Error | t value | Pr(>|t|) | |
|---|---|---|---|---|
| (Intercept) | 5.9972764 | 0.0248047 | 241.77964 | 0 |
| num_critic_for_reviews | 0.0030096 | 0.0001286 | 23.39832 | 0 |
| Estimate | Std. Error | t value | Pr(>|t|) | |
|---|---|---|---|---|
| (Intercept) | 4.5364386 | 0.0756187 | 59.99097 | 0 |
| duration | 0.0175496 | 0.0006788 | 25.85537 | 0 |
| Estimate | Std. Error | t value | Pr(>|t|) | |
|---|---|---|---|---|
| (Intercept) | 6.3342732 | 0.0170188 | 372.19182 | 0 |
| movie_facebook_likes | 0.0000146 | 0.0000008 | 18.43094 | 0 |
| Estimate | Std. Error | t value | Pr(>|t|) | |
|---|---|---|---|---|
| (Intercept) | 6.4004486 | 0.0166237 | 385.02028 | 0 |
| director_facebook_likes | 0.0000678 | 0.0000055 | 12.43521 | 0 |
| Estimate | Std. Error | t value | Pr(>|t|) | |
|---|---|---|---|---|
| (Intercept) | 6.1424274 | 0.0173876 | 353.2649 | 0 |
| num_voted_users | 0.0000034 | 0.0000001 | 33.0110 | 0 |
| Estimate | Std. Error | t value | Pr(>|t|) | |
|---|---|---|---|---|
| (Intercept) | 6.4069868 | 0.0180362 | 355.228330 | 0 |
| actor_1_facebook_likes | 0.0000064 | 0.0000011 | 5.801156 | 0 |
| Estimate | Std. Error | t value | Pr(>|t|) | |
|---|---|---|---|---|
| (Intercept) | 6.4087059 | 0.0177044 | 361.983859 | 0 |
| actor_2_facebook_likes | 0.0000241 | 0.0000039 | 6.224397 | 0 |
| Estimate | Std. Error | t value | Pr(>|t|) | |
|---|---|---|---|---|
| (Intercept) | 6.4251565 | 0.0176029 | 365.005857 | 0.00e+00 |
| actor_3_facebook_likes | 0.0000386 | 0.0000096 | 4.043426 | 5.36e-05 |
| Estimate | Std. Error | t value | Pr(>|t|) | |
|---|---|---|---|---|
| (Intercept) | 6.431449 | 0.0177846 | 361.62957 | 0.0000000 |
| budget | 0.000000 | 0.0000000 | 2.87624 | 0.0040441 |
| Estimate | Std. Error | t value | Pr(>|t|) | |
|---|---|---|---|---|
| (Intercept) | 6.337511 | 0.0183946 | 344.53107 | 0 |
| gross | 0.000000 | 0.0000000 | 12.78592 | 0 |
We define a significant variable when the p-value of regression is smaller than 0.001.
The second step is to run distance analysis using significant variables. We use two mehods to analyze the data. One is to use Euclidean Distance twice as inputs, the other is to use Euclidean distance and Mahalanobis distance as inputs. We define training set as half of movies in both groups and the other half are used as testing sets. The results are like following.
## [1] "When using both euclidean distance, the successful classification probability is 0.9307"
## [1] "When using euclidean distance and mahalanobis distance, the successful classification probability is 0.9037"
The resuts show that our method has a high prediction accuracy.